ITP 350 Final Project

Evan Celaya

Racquel Fygenson

4/22/18

The Question:

Is the amount of crimes at one's school in New York City correlated with the student's scores on the State Mathematics Exam?

Data Overview

  • Safety data set of 4,241 observations for 33 variables
  • Math data set of 161,403 observations for 17 variables
  • Observations on both data sets are schools (K-12) in New York City

Source: Local NYC Government: https://catalog.data.gov/dataset/new-york-state-mathematics-exam-by-school https://catalog.data.gov/dataset/school-safety-report-8067a

Cleaning the Data: Choosing Variables

  • Restricted the datasets to year 2014 and 2015 only
  • Primarily interested in relationship between mean State Math Exam score from each school and the number of crimes that occured near that school in the same year
  • Retained information about location (borough, school district, longitude, latitude)
  • Retained information about type of crime
  • Removed irrelevant columns

Cleaning the Data: Initial Look

Math Exam Scores

  • Data was very clean, we just deleted columns of no use to us (i.e Building Name within School)

School Safety

  • This data was complicated to clean
  • Dataset organized by location and not by school name
    • Some locations are shared by multiple schools, so whenever a crime occurs at one of those locations, it affects ALL schools at that location

Cleaning the Data: School Safety Reports

Safety data set was full of “N/A” and “# N/A” wherever there were schools that shared a location. It looked a little like this:

Image 1

Next, we separated the table into 2 dataframes: consolidated locations (yellow) and school names (purple). Then we separated the data frames by year (2014, and 2015).

Cleaning the Data: School Safety Reports

Next we joined the consolidated data for 2014 (yellow) with the correct school names for 2014 (purple), and did the same for 2015 (yellow & purple) to look like this:

Image 2

Cleaning the Data: School Safety Reports

Next we had to copy over the consolidated data’s (yellow) crime rates to the individual school names it correlated with (purple) to look like this:

Image 3

Then, we used rbind to connect the the 2014 and 2015 years back together, deleting duplicate rows

Final Cleaned & Massaged Data Frame

In total, we reduced the dimensions of the Safety data set to 3565 observations of 29 variables, and the dimensions of the Math data set to 2250 observations of 10 variables, merging the two data sets to have a final date frame of 2250 observations of 42 variables. We use this dataset for our data analysis and graphing. Below is a quick view of the first 6 variables and 6 observations.

  X    DBN                    School.Name Year NumTested MeanScaleScore
1 1 01M015      P.S. 015 ROBERTO CLEMENTE 2014        63            278
2 2 01M019            P.S. 019 ASHER LEVY 2014       104            308
3 3 01M020           P.S. 020 ANNA SILVER 2014       233            298
4 4 01M034 P.S. 034 FRANKLIN D. ROOSEVELT 2014       264            298
5 5 01M063      THE STAR ACADEMY - P.S.63 2014        53            301
6 6 01M064          P.S. 064 ROBERT SIMON 2014       113            294

Summary Statistics: Distribution of Crime by Type

plot of chunk Crime Type Boxplots 1

Summary Statistics: Distribution of Crime by Type

plot of chunk Crime Type Boxplots 2

Summary Statistics: Score Distributions by Crime

plot of chunk Histogrmas Low & High Crime

Visualizing Relationships: Crime vs. Number of Schools per Location by Year

plot of chunk Linear Models 14 & 15

Visualizing Relationships: Crime vs. Score by Year

plot of chunk Math vs Total Crime

Visualizing Relationships: Spatial Data

plot of chunk Spatial2

Learning Models: Regression Tree

plot of chunk Regression Tree

Learning Models: k-Means Clustering

plot of chunk k-Means

Learning Models: k-Means Clustering

plot of chunk k-Means 2

Conclusion

Overall, the amount of crimes at a school in New York City affected the scores students recieved on the state Mathematics Exam. Through visual analysis and regression modeling, there is a distinct difference between schools with fewer than 3 crimes and schools with more when predicting the mean score on the math exam.